# instructions for assembly

The main code directory of the paper contains all stata code used in the processing of the data. For information on the subdirectories ALM, Measures of wages, montecarlo and macrosim, and Timmer please consult the readme.md files in the respective subdirectory. The subdirectory tabfig_construction contains files that make the aggregated datasets for some figures and tables, separated out due to them needing proprietary data.

The assembly directory also includes the patstat and classification subdirectories that constitute the first two parts and can be run seperately from the rest of the assembly.

## more details on specific files, in order of execution

### Patstat
-   Patstat_2018b_tls231.do | imports the tls 231 code files of patstat. please make sure it is unzipped (i.e. as txt files) in the /datasets/common_data/patstat_raw folder. Selects only the patents from the EPO national phase and saves them.

-   Patstat_2018b.do | Imports and cleanes all other tls codes, again as txt in the /datasets/common_data/patstat_raw folder. Creates full panels of applicants and citations. The following codes are
                       required:

                        - 201 | 204 | 207 | 209
                        - 211 | 212 | 216
                        - 224 | 228 | 230
                        - 801 | 901 | 906



### Classification
-   biadic_families.do | flags patent families that are applied for in multiple countries
-   ipc_cpc_codes.do | makes a combined ipc/cpc (cipc) code and maps to applications
-   docdb_families_ipc_codes.do | maps patent families to the combined cipc codes
-   appln_ipc.do | maps cipc to applications in lists that python created
-   adjusted_citations.do | nomralized citations by technological field
-   fields.do | matches technical fields to patent families

### Measures of wages
-Prepares different kinds of data, includimng wages, GDP etc.. See subdirectory measrues of wages for details.

### Timmer
-includes adapted versions of their files for our purposes. see subdirectory for details.

### Orbis (not a separate directory, but executed seperately)
- orbis_patents.do | Collects orbis companies, names and NAce codes for later use
- companies.py | Normalizes orbis firm names and collects subfirms (e.g.. Siemens health, Siemens energy -> Siemens)

### Assembly
-   MP_import.do | Imports Mann-Puettmann 2021 patent data | Requires patents1.csv and patents2.csv from their Github repository to run, otherwise skipped. place in ./datasets/common_data/mann_puettmann folder. |  See sources.md
-   merge_firms.do | matches firms and their patent applications into a single file. Requires Orbis data, skipped otherwise. | see sources.md
-   BvD_industry.do | matches firms to NACE industries | Requires Orbis data, skipped otherwise. Resulting files are provided. | see sources.md
-   BvD_groups.do | groups firms by number of patents, used only for table A4a | Requires Orbis data, skipped otherwise. Resulting files are provided. | see sources.md
-   import_classification.do | imports the patent application files of created by python and restricts to machinery codes | requires patstat data. skipped if not present. Resulting files are included. | see sources.md
-   import_indep_vars | Imports most raw RHS variables from the "Measures of Wages" package by sector (total or manufacturing). No narmalization at this point, just renaming, reshaping etc.variables imported here are corrected for inflation and exchange rates, wages already split by skill-level.
-   refined_placebos.do | builds patent lists of refined placebos (see file for details) | Requires patstat data. Skipped if not present. resulting files are provided.
-   family_timeseries.do | builds an overview of all patent families and their attributes (biadic, machinery, patent authority etc.) | Requires patstat data and is skipped if not present. Resulting files are provided. | see sources.md
-   build_patlevel_classification_stats.do | does what it says on the lid. | Requires Patstat data. skipped if not present. Resulting files are provided | see sources.md
-   weights.do | creates patent weights, run multiple times with different windows. | Requires patstat and Orbis data for some versions. those versions are skipped if data is not present. Resulting files provided. | see sources.md
-   spillover_weights.do | creates the weights used to assign a firms spillvoers received from the countrystocks. | Requires Patstat and Orbis data. skipped if not present. Resulting files are provided. | see soures.md
-   make_inventorcountry.do | uses spillover weights (inventorweights) instead of patent weights to define the home country.
-   dep_vars.do | builds a firms patent counts and from that the patent stocks. firm-year level. | Requires Patstat and Orbis data. skipped if not present. Resulting files are provided. | see soures.md
-   bvd_year_lists.do | creates lists of firm-year combinations based on depvars and weights that are used by other files
-   spillover_stocks.do | creates the countrystocks of patents that get distributes to firms based on weights. | Requires patstat data and is skipped if not present. Resulting files are provided. | see sources.md
-   spillovers.do | brings together spillvoer_weights.do and spillover_stocks, does the actual distribution of the countrystocks to the firms
-   make_indep_vars.do | first part of normalization and merging together the different RHS variables, run multiple times with different windows
-   make_final_dataset.do | creates the panel from which the regression is run by merging all LHS and RHS datasets and normalizing home and foreign values, run multiple times with different windows
-   regression_firms.do | saves the identifiers of all firms that fulfill the criteria for the baseline regression (i.e. no missing weights etc)
-   df_totlsw.do | similar to make_final_dataset.do, but for total low skilled wage compensation instead of lowskilled wage. not different window but different variables, hence separate
-   df_ovb.do | builds the offshoring and recent innovations RHS variables as well as and stocks, spillovers for the later. | Requires patstat data and is skipped if not present. Resulting files are provided. | see sources.md
-   df_inventorweighted_wages.do | builds the wages firms are exposed to that are weighted by inventors for table A33
-   df_longdiff.do | creates a dataset for the five-year difference estimation by summing up within the 5 year windows and claculating teh difference to the neighbouring one
-   df_predicted.do | equivalent to make_indep_vars.do and make_final_dataset.do but using predicted RHS variables.
-   build_bhj_data.do | creates a dataset useable in a linear regression setting following Borusyak, Hull and Jaravel, 2022
-   build_mp_comparison.do | Builds a dataset for the comparison of our classification with Mann & Puettmann 2021. Requires Mann-Puettmann data (see MP_import.do) and Patstat data. skipped if not available. The resulting file is provided. | see sourced.md
-   sample_descriptives.do | makes statistics on the firms, applications and patents present in the baseline regression. | requires Patstat and Orbis data and is skipped if not present. Resulting file is provided. | see sources.md
-   make_indepvars_hinvt_hq.do | equivalent to the normal file, but with different homecountry
-   make_final_dataset_hinvt_hq.do | again, equivalent to the standard file. 
-   df_predictingweights.do | builds weights by predicting based on wages.

### tabfig construction

Contains some code that assembles datasets used either only in tables and figures or for those that require proprietary data.
-   nace_industry.do | creates a file used only for table A2. Aggregates NACE industries to the top 2 levels.
-   fig_A1_data.do | Aggregates Patents across industries- requires Patstat data. Skipped if not present
-   tab_A24_data.do | Just a cross walk from applications to patent families for the Mann-Puettman comparison. Uses a patstat file and is skipped if patstat is not present 
-   tab_A25_data.do | Same issue as for tab_A24_data.do
-   tab_A41_data.do | Citationweighting also needs patstat data. Skipped if not present.